1. Introduction

This document explores qualitative indicators from an ActivityInfo database that is monitoring Ecuador.

Indicator count totals
Nov 2013 to May 2019
Date Quantity Select Single-line text Multi-line text % of total data collected
Nov 2013 141,442 30,531 0 6,309 3.54%
June 2015 1,887,857 745,841 85,863 57,128 2.06%
Sept 2016 3,380,991 1,296,548 191,640 116,184 2.33%
May 2017 4,932,977 1,809,419 265,196 168,599 2.35%
May 2019 12,174,327 7,595,829 2,683,945 915,948 3.92%



From the perspective of ActivityInfo, it shows a clear need for new tools to support analysis of qualitative data as the absolute volume of qualitative data has increased by a factor of 150, and almost doubled as a relative share of all data collected.

Data preparation

This section gives some ideas how the raw data looks like. The data has been extracted from ActivityInfo by using the ActivityInfo API and pre-processed to make it ready for the analysis. The most of data extraction and cleaning are done beforehand (please see R/ folder in the repository especially take a close look at etl.R and etl-methods.R files). If you want to download the raw data, you must have an access for it, that can be done by sourcing the etl.R file.

Click at the button below to glimpse at the raw data and simple explanations of the columns:

##    databaseId      databaseName    folderId    folderName      formId
## 1 d0000010297 ECUADOR_MONITOREO f0000021276 Objectivo_1.1 a1424455933
## 2 d0000010297 ECUADOR_MONITOREO f0000021276 Objectivo_1.1 a1424455933
## 3 d0000010297 ECUADOR_MONITOREO f0000021276 Objectivo_1.1 a1424455933
## 4 d0000010297 ECUADOR_MONITOREO f0000021276 Objectivo_1.1 a1424455933
## 5         ...               ...         ...           ...         ...
##   formName  subFormId      subFormName            recordId   Month
## 1    Salud cjs13y74y2 Monthly Sub-Form s0462106109-2019-02 2019-02
## 2    Salud cjs13y74y2 Monthly Sub-Form s1724792277-2019-02 2019-02
## 3    Salud cjs13y74y2 Monthly Sub-Form s0252135214-2019-02 2019-02
## 4    Salud cjs13y74y2 Monthly Sub-Form s1709156312-2019-02 2019-02
## 5      ...        ...              ...                 ...     ...
##          code                                                 question
## 1 Act_4_Ind_1 # de kits de SSR en la emergencia al ministerio de salud
## 2 Act_4_Ind_1 # de kits de SSR en la emergencia al ministerio de salud
## 3 Act_4_Ind_1 # de kits de SSR en la emergencia al ministerio de salud
## 4 Act_4_Ind_1 # de kits de SSR en la emergencia al ministerio de salud
## 5         ...                                                      ...
##   response required     type partnerName        canton
## 1       13    FALSE quantity       UNFPA    HUAQUILLAS
## 2     <NA>    FALSE quantity       ACNUR   SAN LORENZO
## 3     <NA>    FALSE quantity       ACNUR SANTO DOMINGO
## 4     <NA>    FALSE quantity       ACNUR        TULCAN
## 5      ...      ...      ...         ...           ...
##                         province description formNameRecode
## 1                         EL ORO        <NA>          Salud
## 2                     ESMERALDAS        <NA>          Salud
## 3 SANTO DOMINGO DE LOS TSACHILAS        <NA>          Salud
## 4                         CARCHI        <NA>          Salud
## 5                            ...         ...            ...

Some description about the nature of data:

  • databaseId: the internal ActivityInfo id for databases

  • databaseName: the name of databases visible to users

  • folderId: the internal ActivityInfo id for folders

  • folderName: the name of folders visible to users

  • formId: the internal ActivityInfo id for forms

  • formName: the name of forms visible to users

  • subFormId: the internal ActivityInfo id for the sub-forms where the records are kept

  • subFormName: the name of the sub-forms visible to users

  • Month: indicating month when a record is entered

  • code: Schema question code

  • question: Question label indicated by the code

  • response: Response given by users

  • required: A boolean value to check whether the question is required to complete.

  • type: internal type for the code. The available types in the data are quantity, NARRATIVE, enumerated.

  • partnerName: The name of reporting partners. The name of implementing partners can be extracted from the data.

  • canton: The canton name of the record.

  • province: The province name of the record.

  • description: the description field further explaining what the question mean. The cells are represented as NA when fields not exists or not applicable.

Please see ActivityInfo documentation for more information about how the information is structured.


2. Descriptive statistics

Recode form label names

Before we begin, we shorten the form topic names by recoding them because they appear to be long and disarray the plots. The recode table below presents a lookup table for the form labels.

formName formNameRecode1
Objectivo_1.1
Salud Salud
Seguridad_alimentaria AlimentSegur
Agua, saneamiento e higiene Agua
Alojamiento Temporal Alojamiento
Transporte humanitario Transport
Necesidades básicas/Otro Necesidades
Objectivo_1.2
Manejo de la información y entrega directa de la información a la población Poblacion
Manejo de la información para socios y análisis de las necesidades Socios
Objectivo_2.1
Protección general General
Objectivo_2.2
Protección_Infancia Infancia
Protección_VBG VBG
Trata_y_tráfico Trafico
Protection_Otro Otro
Protección_LGBTI LGBTI
Objectivo_3.1
Acceso_a_educación Educacion
Acceso a vivienda y hábitat dignos en comunidades receptoras Habitat
Objectivo_3.2
Medios de vida y formación técnico-profesional Tecnico
Cohesión_social SocialCohesion
Apoyo Educacional a Comunidades Receptoras Educacional
Objectivo_4.1
Asistencia técnica para VBG-SSR VBG_SSR
Asistencia técnica para protección/gestión de fronteras Fronteras
Objectivo_4.2
Asistencia técnica para gestion de la informacion y coordinacion Coordinacion
Objectivo_4.3
Asistencia técnica para el sector laboral SectorLaboral
Asistencia técnica para protección Proteccion
Asistencia técnica para protección de la infancia ProteccionInfancia
Asistencia técnica para Salud AsistenciaSalud
1 Some short form names (e.g. 'Salud') stay as such, no need to make them even shorter

2.1. Partners

They are two types of partners in the database:

  • Reporting partners: Higher level of partners reporting directly in ActivityInfo.

  • Implementing partners: Partners reporting through a reporting partner.

The table below shows the count of reporting partner per each record:

  • ACNUR has nine hundred eighty-two records, which is 62.0% of the total records.

  • Second, NRC has one hundred thirty-eight records, which is 8.70% of the total records.

  • The most difference between percentages of the partners ACNUR and NRC is 53%.


Reporting partner Frequency Relative frequency
ACNUR 982 0.620
NRC 138 0.087
PMA 106 0.067
UNICEF 84 0.053
OIM 73 0.046
UNFPA 64 0.040
CARE 29 0.018
Dialogo Diverso 26 0.016
Mision Scalabriniana 18 0.011
ADRA 15 0.009
RET 15 0.009
OPS/OMS 8 0.005
PNUD 7 0.004
JRS Ecuador 5 0.003
Plan Internacional 5 0.003
World Vision 5 0.003
UNESCO 3 0.002

The table below shows the proportion of records entered by partners and sub-partners.

  • 677 out of 982 total responses of ACNUR is actually coming from HIAS.

  • UNICEF has more diversed partners in terms of reporting. 44% of responses of UNICEF comes from HIAS. 25% of reporting comes from the UNICEF itself.

  • Under PMA, there are 13 sub-partners. HIAS reports 41% of these records.

Those are the total numbers of reporting in all database, the numbers are not specific to the narratives (multi-line text fields). In the next section, we count the number of reportings done only in the narrative sections.


Implementing partner Frequency Relative frequency
ACNUR
HIAS 677 0.689
ACNUR 291 0.296
JRS Ecuador 5 0.005
NRC 4 0.004
ASA 2 0.002
Federación de Mujeres de Sucumbios 2 0.002
Federación de mujeres de Sucumbíos 1 0.001
NRC
NRC 134 0.971
ACNUR 4 0.029
OIM
OIM 73 1.000
UNFPA
UNFPA 62 0.969
RET 2 0.031
PMA
HIAS 44 0.415
ADRA 10 0.094
Buen Pastor 5 0.047
Fundación de Mujeres de Sucumbios 5 0.047
Fundación Tarabita 5 0.047
Hermanas Salesias 5 0.047
Hogar de Cristo 5 0.047
Pastoral Social Cáritas Tulcán 5 0.047
SJR 5 0.047
World Vision 5 0.047
Alas de Colibri 4 0.038
Casa Matilde 4 0.038
Patronato 4 0.038
UNICEF
HIAS 37 0.440
ADRA 21 0.250
UNICEF 21 0.250
NRC 3 0.036
Centro de Desarrollo y Autogestión 2 0.024
CARE
CARE 29 1.000
Dialogo Diverso
Dialogo Diverso 25 0.962
OIM 1 0.038
Mision Scalabriniana
Mision Scalabriniana 18 1.000
ADRA
ADRA 15 1.000
RET
RET 15 1.000
OPS/OMS
OPS/OMS 8 1.000
PNUD
PNUD 7 1.000
JRS Ecuador
JRS Ecuador 5 1.000
Plan Internacional
Plan Internacional 5 1.000
World Vision
World Vision 5 1.000
UNESCO
UNESCO 3 1.000

Reporting and implementing partners

Which reporting and implementing partners do report (in all fields)?

ACNUR


ADRA


CARE


Dialogo Diverso


JRS Ecuador


Mision Scalabriniana


NRC


OIM


OPS/OMS


Plan Internacional


PMA


PNUD


RET


UNESCO


UNFPA


UNICEF


World Vision


The number of direct and indirect fields

As per canton, province, partner etc.


The number of direct and indirect fields per form topic
¿Implementacion directa o indirecta?
formNameRecode Directa Indirecta NAs Total
Agua 33 17 0 50
Alojamiento 39 24 0 63
Coordinacion 7 36 0 43
Educacion 22 72 0 94
Educacional 1 0 0 1
Fronteras 22 0 0 22
General 115 59 0 174
Habitat 21 13 0 34
Infancia 37 65 6 108
Necesidades 46 74 0 120
Otro 4 53 0 57
Poblacion 71 53 0 124
Proteccion 26 16 0 42
Salud 3 56 6 65
SectorLaboral 1 28 0 29
SocialCohesion 32 14 0 46
Socios 29 0 0 29
Tecnico 28 153 0 181
Trafico 4 0 0 4
Transport 5 0 0 5
VBG 55 72 0 127
VBG_SSR 17 2 0 19

The number of direct and indirect fields per canton
¿Implementacion directa o indirecta?
canton Directa Indirecta NAs Total
AMBATO 3 0 0 3
BAÑOS DE AGUA SANTA 4 0 0 4
CUENCA 3 63 0 66
ELOY ALFARO 0 2 0 2
ESMERALDAS 31 58 0 89
GUAYAQUIL 30 88 0 118
HUAQUILLAS 69 41 2 112
IBARRA 69 79 1 149
LAGO AGRIO 92 111 1 204
LATACUNGA 4 0 0 4
MACHALA 8 8 1 17
MANTA 12 0 0 12
ORELLANA 1 0 0 1
PEDERNALES 1 0 0 1
QUEVEDO 4 0 0 4
QUITO 128 112 1 241
RIOBAMBA 3 0 0 3
SALINAS 1 0 0 1
SAN LORENZO 7 55 0 62
SAN MIGUEL 1 0 0 1
SANTO DOMINGO 2 68 0 70
TULCAN 145 122 6 273

The number of direct and indirect fields per partners
¿Implementacion directa o indirecta?
subPartnerName Directa Indirecta NAs Total
ACNUR
ACNUR 240 44 3 287
ASA 0 2 0 2
Federación de Mujeres de Sucumbios 0 2 0 2
Federación de mujeres de Sucumbíos 0 1 0 1
HIAS 0 664 0 664
JRS Ecuador 0 1 0 1
NRC 0 4 0 4
ADRA
ADRA 13 0 2 15
CARE
CARE 27 0 0 27
Dialogo Diverso
Dialogo Diverso 13 0 0 13
OIM 0 1 0 1
JRS Ecuador
JRS Ecuador 5 0 0 5
Mision Scalabriniana
Mision Scalabriniana 18 0 0 18
NRC
ACNUR 0 4 0 4
NRC 133 0 1 134
OIM
OIM 73 0 0 73
OPS/OMS
OPS/OMS 1 4 1 6
Plan Internacional
Plan Internacional 3 0 0 3
PNUD
PNUD 7 0 0 7
RET
RET 15 0 0 15
UNESCO
UNESCO 3 0 0 3
UNFPA
RET 0 2 0 2
UNFPA 58 0 4 62
UNICEF
ADRA 0 21 0 21
Centro de Desarrollo y Autogestión 0 2 0 2
HIAS 0 37 0 37
NRC 0 3 0 3
UNICEF 4 15 1 20
World Vision
World Vision 5 0 0 5

2.2. Narrative data

In this section, we focus on a subset of the reports, which do particularly have the multi-text fields, called “Narrative data” in ActivityInfo terms. Plain saying that narrative data is multi-line text fields allowing users to enter long texts.

Note that we also keep the narratives that are empty (which are displayed as NA, Not Available).

The number of narrative records in form topics

formNameRecode Response Response rate1
filled missing total
Tecnico 182 1085 1267 0.168
Poblacion 123 249 372 0.494
VBG 78 430 508 0.181
Coordinacion 43 215 258 0.200
Educacion 43 333 376 0.129
SectorLaboral 29 116 145 0.250
SocialCohesion 28 110 138 0.255
Socios 28 175 203 0.160
Proteccion 25 59 84 0.424
Fronteras 22 44 66 0.500
Habitat 18 84 102 0.214
Alojamiento 12 51 63 0.235
Necesidades 12 228 240 0.053
VBG_SSR 11 27 38 0.407
LGBTI 7 17 24 0.412
Salud 6 124 130 0.048
Trafico 4 8 12 0.500
Agua 1 99 100 0.010
Educacional 1 1 2 1.000
ProteccionInfancia 1 0 1 Inf
AlimentSegur 0 118 118 0.000
1 Divide the number of filled responses in the number of missing responses


The table shows the narrative fields with count of ’full’ and ‘empty’ fields.

  • The form Tecnico (Medios de vida y formación técnico-profesional) has 182 and 1085 records in a total of 1267 records.

  • The form VBG (Protección_VBG) has 78 and 430 records in a total of 508 records.

  • The form Educacion (Acceso_a_educación) has 43 and 333 records in a total of 376 records.

  • The form Poblacion (Manejo de la información y entrega directa de la información a la población) has 123 and 249 records in a total of 372 records.

  • The form Coordinacion (Asistencia técnica para gestion de la informacion y coordinacion) has 43 and 215 records in a total of 258 records.

Please note that the following form topics are not included in the table above because they do not contain any narrative fields: Transporte humanitario, Protección general, Protección_Infancia, Protection_Otro, Asistencia técnica para Salud

Partners entering narrative data

Here, we look at the partners entering narrative data. The number of missing records (namely NAs) are excluded.

The number of ‘Reporting Partners’ and ‘Implementing Partners’ reporting narrative (multi-line text) data:

As we have seen previously, Not all Reporting and Implementing Partners record multi-line narrative textual data. For instance, the partner PMA has lots of Implementing Partners reporting for the different data types (as seen in above) but there are no narratives from them.

  • 71% of the narrative records are entered by the implementing partner HIAS reported via ACNUR. Only 27% of the narrative records are entered by ACNUR itself.

  • Also HIAS enters 26% of the narrative records via UNICEF.

  • The rest of the “reporting partners” do not have any “implementing partners”, as it seems that they do the implementation: CARE, Dialogo Diverso, JRS Ecuador, Mision Scalabriniana, OIM, OPS/OMS, Plan Internacional, PNUD, UNESCO, UNFPA.

The number of cantons and provinces entering narrative data

Canton and provinces
The number of reports in the multi-text (narrative) fields
Canton Frequency Relative frequency (canton) Relative frequency (province)
PICHINCHA
QUITO 756 1.000 0.178
CARCHI
TULCAN 689 1.000 0.162
SUCUMBIOS
LAGO AGRIO 617 1.000 0.145
IMBABURA
IBARRA 463 1.000 0.109
GUAYAS
GUAYAQUIL 375 1.000 0.088
ESMERALDAS
ESMERALDAS 269 0.568 0.063
SAN LORENZO 197 0.416 0.046
ELOY ALFARO 8 0.017 0.002
SANTO DOMINGO DE LOS TSACHILAS
SANTO DOMINGO 265 1.000 0.062
EL ORO
HUAQUILLAS 261 0.900 0.061
MACHALA 29 0.100 0.007
AZUAY
CUENCA 235 1.000 0.055
MANABI
MANTA 30 1.000 0.007
COTOPAXI
LATACUNGA 11 1.000 0.003
TUNGURAHUA
BAÑOS DE AGUA SANTA 11 0.611 0.003
AMBATO 7 0.389 0.002
LOS RIOS
QUEVEDO 10 1.000 0.002
BOLIVAR
SAN MIGUEL 7 1.000 0.002
CHIMBORAZO
RIOBAMBA 7 1.000 0.002


Treemap plot showing canton and province reporting frequencies.

3. Analysis

Response quality

Response quality is a term that means how much response the questions receive. The idea is to find cases that affect the response quality to understand if they work or not under some conditions. In the end, the results may reveal some handful insights about the quality of textual responses in the narrative fields.

Some questions to research response quality by measuring word count:

  • Is there any relationship between the word counts of response, question and description fields?

  • What is the distribution between response word count and explanatory variables such as the question, form topic, canton name, partner name, etc.

Assumptions:

  • Responses with a larger word count have more quality than the responses with smaller word count.

In other words, we assume that the more word the better is. The limitations are based on the unequal distribution of the data. The word count of responses and questions can be related to other things, such as the questions require short answers so then the responses tend to be shorter.

Additionally, we can have a cross-analysis to test these outcomes. It might be a good idea to have a small subset of data and ask an expert to test the assumptions qualitatively. For instance, we can take the first twenty responses with the highest word count and the last twenty responses with the lowest word count. We chose the extreme directions because they point out the greatest differences which are easier to test assumptions.

Word count

One issue with the nature of the questions is that they are only unique in a form. These questions can be distributed across multiple forms. The questions sharing the same name will have different meanings. For instance, the question “Cualitativo” from the form “Salud” should imply different thing than the question “Cualitativo” from the form “Protección_VBG”.

In order to solve this kind of problem:

  • We can combine question with the form and also its folder label. There we can achieve a unique name for each question.

  • Another thing to resolve this would be doing analysis to move the analysis up to form level. In this file, we did both, therefore the analysis shown as below:

Count of responses per topic/question (note that missing entries NAs are removed):

folderName formName formNameRecode Month question response description partnerName subPartnerName province canton .responseWordCount .questionWordCount .descriptionWordCount
Objectivo_1.1 Salud Salud 2019-02 Cualitativo 1. Entrega de k Descripción de UNFPA UNFPA CARCHI TULCAN 302 1 17
Objectivo_1.1 Salud Salud 2019-02 Cualitativo 1. Entrega de k Descripción de UNFPA UNFPA EL ORO HUAQUILLAS 302 1 17
Objectivo_1.1 Salud Salud 2019-02 Cualitativo 1. Entrega de k Descripción de UNFPA UNFPA EL ORO MACHALA 302 1 17
Objectivo_1.1 Salud Salud 2019-02 Cualitativo 1. Entrega de k Descripción de UNFPA UNFPA SUCUMBIOS LAGO AGRIO 302 1 17
Objectivo_1.1 Salud Salud 2019-04 Cualitativo Se complementa Descripción de UNFPA UNFPA SUCUMBIOS LAGO AGRIO 13 1 17
Objectivo_1.1 Salud Salud 2019-04 Cualitativo 233 Equipos méd Descripción de UNFPA UNFPA ESMERALDAS SAN LORENZO 46 1 17

It’s also a good practice to see the number of questions. For example, one question has two responses, therefore they’re short.


Box plots are used to visualize the measure of spread showing the variability and dispersion of the data.

In the boxplot above, we see a number of things:

  1. each individual black point in a group represents a “response” in the records, and it position indicates the value of word count;

  2. left and right borders of the central rectangle (colored red) presents the first and third quartile values of IQR respectively;

  3. the line in the middle of the rectangle indicates the median value;

  4. the end of the lines streched from both right and left sides of the central rectangle point the maximum and minimum values;

  5. the orange colored points show the outliers.

Some insights from this plot can be: TODO

The standard deviation is a single number statistics to show the measure of spread in data.

Deviation of response word counts per form topic
Measure the spread with standard deviation
formNameRecode SD
Agua NA
Alojamiento 71.15
Coordinacion 59.13
Educacion 56.56
Educacional NA
Fronteras 45.15
Habitat 6.64
LGBTI 123.48
Necesidades 19.37
Poblacion 69.82
Proteccion 21.69
ProteccionInfancia NA
Salud 141.10
SectorLaboral 21.80
SocialCohesion 29.35
Socios 106.79
Tecnico 49.04
Trafico 17.30
VBG 44.27
VBG_SSR 34.22


In the plot above, the box plot of form topics and response word counts based on the raw data, the outliers are shown in orange color. Outliers are the points placed outside the whiskers, which is the long line, of the boxplot.

The response word count distribution per form topic categorized by partner name:

The response word count distribution per form topic categorized by canton name:



A caveat: Reducing multiple values down to a single value should be avoided in the early stages of the analysis because reducing hides a lot e.g. a bar chart showing average the word count per partner. Some partners may write longer than others, because:

  1. They actually write longer than other partners.

  2. The questions they answered require short answers.

The Description field

Some questions have the description field giving extra details about the questions.

Do some questions with the extra description field have better response quality than the questions which do not have it?

Looking at the table containing form name, question, description and so on:

We see in the plot below that the response word counts per form and colored if a response has a description field or not. Having a description field or not is calculated as that a description field has a minimum one word.


The responses with the longest word counts are the ones with description. Nevertheless, it is not so easy to see a clear trend that there’s a correlation between response word count and description fields. Interestingly, the form topic Protección_VBG has no description fields at all in its form topics.

Analysis of Variance

TODO ANOVA

Correlation

The correlation between the word count of different fields in the ActivityInfo:

.responseWordCount .questionWordCount .descriptionWordCount
.responseWordCount 1.00 -0.14 0.13
.questionWordCount -0.14 1.00 -0.59
.descriptionWordCount 0.13 -0.59 1.00
  • 0.15% of the words in the response field have only 1 words. And those words were just not important (as they are TEST).

TODO

The regression lines

We can look at multiple continuous variables in our data.

  • word count of response field: response.wc

  • word count of question field: question.wc

  • word count of description field: description.wc

Given this data, the model formula showing dependent and independent variables can be as follows:

$ response.wc question.wc $

$ response.wc question.wc + description.wc $

In that sense the word count of response field is the dependent variable and word count of question and word count of description fields are the independent variables in the regression.

Thus, we expect to have more word count in question and description fields to have a positive effect on the word count of response field.

Scatter plots help understand the characteristics of those variables. However, we miss a general understanding that is the trend line.

The gray area around the lines shows the confidence band at the 0.95 level. Although there’s a straight slope in the linear regression line, we cannot say that the trend line is robust because the confidence band representing the uncertainty in the estimate is wide.

Logistic regression

TODO In fact, it is called the binomial logistic regression. When one of the independent variables is dichotomous (having two categories), ...

Hypothesis testing

We do the hypothesis testing based on the assumptions we have.

$ H_0 $ : More word count in questions results in more word count in responses.

TODO

4. Text mining & analysis

In that section, we take text as data.

4.1. Textual data preparation

Describe how to prepare textual data and what common steps are usually performed.

They are usually four steps involved in this process:

1. Tokenization

Tokenization means to split a text into tokens considered meaningful units of text. A token can either be a word (and often it is) or a group of words (such as bigram), or even a sentence that depends on the level of analysis.

Perform stemming, which you bring words (nouns/verbs) back to base or infinitive forms, will be the next step after tokenization, so we can get the essence of words.

2. Strip punctuation

Punctuation is often not required in text analysis (unless a researcher wants to tokenize the text based on a specific classifier such as sentence tokens); therefore, they create noise.

3. Convert text into lowercase

When the text turned into lowercase, for instance, the words respuesta and Respuesta will no longer be taken as different words.

4. Exclude stopwords & numbers

Stop words usually mean the most common words in a language that will bring no significant results in analysis. They are overly distributed in the text and they will not give so meaningful results itself. Stop-words are including articles (el/la), conjunctions (y), pronouns (yo/tú/etc.) and so on.

In text mining, this process is usually done after the text converted into lowercase so one does not have to provide stop words including both lower and sentence case versions.

We have imported a list of Spanish stopwords data (source here, and that’s the alternative for stopwords_es list from the corpus package) and perform a filtering join returning tokens from textual data by excluding the words listed in the stopwords. that only returns the tokens not listed in the stopwords.

The original tokens for the response originally have 38216 rows. However, after merging stop words, the number of rows have decreased to 19429 and that the change in between is 51%.

It’s also possible to add more custom words e.g. ACNUR, if some organization names are not desired, or violencia, if some words are overused and brings no further explanation, in the results.

5. Perform stemming

Stemming is a process that removes the suffixes (and sometimes prefixes) of the words and bring them to the base form. We use “Hunspell” stemmer from the package hunspell that provides more precise stemming behavior.

From that point onwards, we will use stemmed words instead of the raw tokenized words because stemmed words give us better information.

After stemming, the words look like this:

word word_stem
entrega entregar
kits kits
salud salud
sexual sexual
reproductiva productivo
... ...



4.x. Sentiment analysis

Sentiment analysis (also called as opinion mining) is a technique to understand the emotional meanings of text given by a dictionary describing the positive/negative words that already done by humans.

The responses seem to be written with a formal tone of voice; therefore, the responses may not show any sentiment at all.

First, we find a sentiment lexicon for the Spanish language (source here).

A wordcloud showing positive and negative words in the responses:

4.3. The key themes of each response (by using term-frequency and tf-idf)

TODO

4.4. The relationship between words (n-grams)

Higher order n-grams

We can search the multi-type terms in the responses.

So the end result looks like this:

word1 word2 bigram
baterias sanitarias baterias sanitario
camas literas cama litera
plaza metálicas plaza metálico
metálicas colchones metálico colchón
colchones 34 colchón 34
... ... ...

tf-idf values can also be calculated for bigrams, and visualized within each reporting/implementing partner, province/canton and so forth.

4.4. Term-frequency matrix

TODO

4.5. The response structure and the sentence lengths

TODO

5. References

 
The QualMiner project explores the qualitative data used for Venezuelan refugee response by applying text analysis & mining techniques. The project is funded by the UNHCR Innovation Fund. This document last modified on: